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Abstract: Accurate peach detection is a prerequisite for automated agronomic management, e.g., peach mechanical 
harvesting. However, due to uneven illumination and ubiquitous occlusion, it is challenging to detect the peaches, es- 
pecially when the peaches are bagged in orchards. To this end, an accurate multi-class peach detection method was 
proposed by means of improving YOLOv5s and using multi-modal visual data for mechanical harvesting in this pa- 
per. RGB-D dataset with multi-class annotations of naked and bagging peach was proposed, including 4127 multi- 
modal images of corresponding pixel-aligned color, depth, and infrared images acquired with consumer-level RGB- 
D camera. Subsequently, an improved lightweight YOLOv5s (small depth) model was put forward by introducing a 
direction-aware and position-sensitive attention mechanism, which could capture long-range dependencies along 
one spatial direction and preserve precise positional information along the other spatial direction, helping the net- 
works accurately detect peach targets. Meanwhile, the depthwise separable convolution was employed to reduce the 
model computation by decomposing the convolution operation into convolution in the depth direction and convolu- 
tion in the width and height directions, which helped to speed up the training and inference of the network while 
maintaining accuracy. The comparison experimental results demonstrated that the improved YOLOvSs using multi- 
modal visual data recorded the detection mAP of 98.6% and 88.9% on the naked and bagging peach with 5.05 M 
model parameters in complex illumination and severe occlusion environment, increasing by 5.3% and 16.5% than 
only using RGB images, as well as by 2.8% and 6.2% when compared to YOLOv5s. As compared with other net- 
works in detecting bagging peaches, the improved YOLOvSs performed best in terms of mAP, which was 16.3%, 
8.1% and 4.5% higher than YOLOX-Nano, PP-YOLO-Tiny, and EfficientDet-D0, respectively. In addition, the pro- 
posed improved YOLOv5s model offered better results in different degrees than other methods in detecting Fuji ap- 
ple and Hayward kiwifruit, verified the effectiveness on different fruit detection tasks. Further investigation revealed 
the contribution of each imaging modality, as well as the proposed improvement in YOLOvSs, to favorable detection 


results of both naked and bagging peaches in natural orchards. Additionally, on the popular mobile hardware plat- 
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form, it was found out that the improved YOLOv5s model could implement 19 times detection per second with the 
considered five-channel multi-modal images, offering real-time peach detection. These promising results demonstrat- 
ed the potential of the improved YOLOvSs and multi-modal visual data with multi-class annotations to achieve visu- 
al intelligence of automated fruit harvesting systems. 
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1 Introduction 


Peach is the third most productive temperate 
tree species behind apple and pear, and is an excel- 
lent source of vitamin C". Peach harvesting is a 
time-consuming and challenging task and highly de- 
pendent on labor. Efficient and easy harvesting 
methods are required to meet the fruit needs of a 
growing global population and improve orchard pro- 
ductivity”. The research and improvements of auto- 
mated technology like mechanical harvesting have 
provided farmers with a practical approach to in- 
crease production. Traditional methods typically 
use segmentation algorithms or shape features such 
as color, shape, or texture to detect certain types of 
fruit**!, In orchards, the detection of peaches is a 
challenging task due to the occlusion of branches 
and leaves of peach trees, as well as ever-changing 
illumination, which results in it being difficult to ac- 
curately detect peaches using traditional methods". 

The first task of automated harvesting peach is 
to accurately detect peach. In-field fruit detection 
has been widely used in a variety of fruits. Howev- 
er, most of the images acquired in traditional meth- 
ods were under controlled illumination, which 
makes them vulnerable to complex orchard environ- 


ments!” Additionally, other environmental factors, 


such as changing appearance and morphology size 
of fruits, can also impose critical effect on the detec- 
tion accuracy. Compared to the traditional methods, 
deep learning has strong adaptability to differences 
within a working scene, which has become one of 
the most promising techniques for applications in 
learning image features. So progressively, deep 
learning algorithms have been widely used in fruit 
detection for agricultural robots in unstructured en- 


vironments” 


! However, in the real fruit orchards, 
one of the greatest challenges in fruit detection were 
caused by complex orchard environment, such as 
changing complex background". Meanwhile, the 
varying scales of fruit targets also caused substan- 
tial difficulties in detecting fruits, especially when 
the fruits were bagged in orchards. Therefore, with 
the vigorous development of deep learning, derived 
from the imitation of human vision, attention mech- 
anism was applied to enhance the model's percep- 
tion ability under complex environment. In recent 
years, many strategies of attention mechanism have 
been widely adopted for various fruit detection 
tasks. Li et al."” introduced a deep learning target 
detection algorithm based on improved YOLOV4 ti- 
ny that combined an attention mechanism and the 
idea of multi-scale prediction to improve the recog- 


nition effect of occluded and small-target green pep- 
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pers. Jiang et al."*! detected young apples efficiently 
by adding a non-local attention module and convo- 
lutional block attention model to a YOLOv4 model. 


Huang et al." 


extended the target detection algo- 
rithm by adding convolutional block attention mod- 
ule (CBAM) to improve the performance of citrus 
detection. Overall, these studies demonstrated that 
the use of attention mechanism could enhance the 
model's detection performance and adapt to the nat- 
ural environment with complex backgrounds. 
Nevertheless, many challenging tasks still re- 
main when seeking to effectively detect fruits in 
practical scenes. Under actual agricultural produc- 
tion environment, using RGB images as the only in- 
formation to detect fruits was undesired when there 
were interference factors, such as occlusion or over- 
lap of fruits and ever-changing illumination''*!, For- 
tunately, with the development of consumer-level 
RGB-D cameras, such as Microsoft Kinect and In- 
tel RealSense, the increasing amount of information 
like depth data and infrared data provides us with 
additional cues to address these problems. Sa et al." 
input the RGB images and infrared images to Faster 
R-CNN for sweet pepper identification. Fu et al." 
developed an outdoor machine vision system with 
RGB-D camera to improve apple identification by 
using depth features to filter out the background ob- 
jects. Arad et al."” presented a robot for harvesting 
sweet pepper fruits in the greenhouse. The robotic 
system equipped with an RGB-D camera acquired 
color and depth information for detecting and locat- 
ing each fruit. In these aforementioned studies, it 
has been claimed that the introduction of more mod- 
al information in addition to RGB could contribute 
to the performance improvement of fruit detection. 
However, those studies mainly focused on detecting 
naked fruits without severe occlusion and overlap- 


ping because of standard planting or fruiting-wall 


architectures. As a matter of fact, bagging late ripen- 
ing and high-quality fruits is one of the most popu- 
lar ways to prevent diseases and extend storage du- 
ration. This agronomic measurement increases the 
difficulty of in-field fruit detection because it brings 
more severe occlusion and irregular target shapes. 
Therefore, in addition to detecting naked fruits, it 
is also meaningful to investigate how to detect 
bagging peaches in an effective way. 

For these reasons an efficient detection model 
of using three-dimensional spatial geometry and 
backscatter signal intensity information from multi- 
modal images to detect in-field naked and bagging 
peaches for guiding mechanical harvesting was pro- 
posed in this paper. More specifically, an RGB-D 
dataset of naked and bagging peaches was present- 
ed, including 4127 corresponding color, depth, and 
infrared images obtained by the RGB-D camera. Ac- 
cording to the fruit picking strategy and field occlu- 
sions, the peaches were classified into four classes: 
un-occluded, occluded by leaves, occluded by 
fruits, and occluded by branches. Remarkably, the 
optimized detector for detecting peaches was put 
forward by introducing the coordinate attention 
mechanism and depthwise separable convolution in 
YOLOv5s. For purpose of evaluating the perfor- 
mance of the improved YOLOv5s using multi-mod- 
al images on naked and bagging peach detection 
and exploring the contribution of each imaging mo- 
dality on environmental adaptation, abundant exper- 
iments from various aspects were implemented. Fur- 
ther investigation revealed the contribution of each 
imaging modality and the improved YOLOvSs in al- 
leviating the negative influence of complex illumi- 
nation and severe occlusion. Besides, the computa- 
tional time of the proposed detection model could 
meet the requirements of real-time detection 


through its successful optimization and deployment 
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on NVIDIA Jetson Nano. This study might provide 
the possibility and foundation for performing visual 
intelligence in mechanical harvesting by means of 
utilizing the improved YOLOvSs and multi-modal 


visual data with multi-class annotations. 


2 Materials and Methods 


2.1 Data acquisition 


The images acquisition was conducted using 
Microsoft Azure Kinect RGB-D camera (Key pa- 


rameters listed in Table 1), which incorporated an 


RGB (Red-Green-Blue) sensor and a depth sensor 
that works based on the ToF (Time of Flight) princi- 
ple. Data were acquired in a farming peach orchard 
located in Dawei Town, Hefei City, Anhui Province, 
China. There were two types of agronomic measure- 
ment in orchards, including naked and bagging 
peaches. According to the planting methods and rip- 
ening period, high-quality and late-ripening peaches 
were usually bagged with red papers to prevent ex- 
treme climate and disease damage. On the contrary, 
those early ripening peaches tended to be naked to 


facilitate harvesting. 


Table 1 Key parameters of Azure Kinect DK camera 


Feature Parameter 


Feature Parameter 
RGB camera resolution/ pix 1280 720 
RGB camera FOV (Field of View )/(°) 90X 59 
Depth camera resolution/ pix 640 < 576 
Depth camera FOV/(°) 120 120 


External dimension/ mm 126X 103 x 39 
Device interface USB3.0 
Effective distance/m 0.25~2.88 


Ranging principle ToF (Time of flight) 


Fig. 1 shows the illustration of the data acquisi- 
tion situation, on the left side are naked peach trees 


and on the right side are bagging peach trees. 


Fig. 1 View of the naked and bagging peach visual data ac- 


quisition illustration 


The RGB-D camera provides three different 
types of data: RGB image, IR backscattered intensi- 
ty (IR), and depth image (Depth) that can be used to 
locate the peaches. The image data were collected 
in peach orchards during sunny and cloudy weath- 
er conditions. The collection periods were 7 a.m. — 
9 p.m. from August to September. During image ac- 


quisition, the camera was aimed perpendicular to 


the sunlight direction to capture the multi-modal im- 
ages of peaches under normal illumination condi- 
tion. The camera's viewing direction was set paral- 
lel to the sunlight direction to capture the multi- 
modal images of peaches under strong illumination 
condition. Also, the multi-modal images were gath- 
ered under artificial illumination condition during 
night. Considering the fact that the occlusion would 
affect the detection performance, some images were 
collected with different degrees of occluded targets 
from multiple viewing angles during image acquisi- 
tion. According to the proportion of target area oc- 
cluded by branches and leaves, the occlusion levels 
were classified into Slight occlusion (occluded by 
0—30%), General occlusion (occluded by 30%— 
60%), and Severe occlusion (occluded by 60%— 
100%), respectively. Additionally, in order to better 
simulate the changing distance of the camera during 
mechanical harvesting, the camera was placed at the 


distance of 0.1 m to 1.5 m away from the tree trunk. 
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The distance within 0.3 m between tree trunk and 
the camera was considered as close distance to sim- 
ulate the end-effector approaching the target. Rang- 
ing from 0.3 m to 1 m was considered as average 
distance to simulate the position of the camera de- 
tecting the majority of target fruits. The distance of 
greater than 1 m was considered as far distance to 
simulate the position of the camera relatively far 
away from target fruits. 

Specific software written in C++ was devel- 
oped to collect and save data automatically. The 
software drove the RGB-D camera to implement in- 
situ data recorded 5 times/s. Each time, the recorded 
data contained pixel-aligned one RGB image, infra- 
red image, and depth image. In total, 4127 pairs of 
multi-modal images were acquired, examples were 


shown in Fig. 2. 


Ba 


gging peaches 


Depth 


(8bit) 


IR 
(8bit) 


Fig. 2 Examples of multi-modal images of naked and 


bagging peach captured by RGB-D camera 


2.2 Multi-class peach RGB-D dataset 


Manual annotation was applied after the imag- 
es were collected. Considering that the natural or- 
chard existed ubiquitous occlusions among leaves, 
branches, and fruits. Therefore, according to the ro- 


botic picking strategy and in-field occlusion status, 


bounding boxes were drawn and the categories 
were classified into multi-class to achieve selective 
picking and prevent damage to the end-effectors or 
robots''*!, The first class indicated that the peaches 
were not occluded (referred to as NO in this work); 
the second class indicated that the peaches were on- 
ly occluded by leaves (referred to as OL) and not 
occluded by other peaches and branches; the third 
class indicated that the peaches were occluded by 
other peaches (referred to as OF); the fourth class 
indicated that the peaches were occluded by branch- 
es (referred to as OB). As we know, in the process 
of mechanical picking, the collision between the ro- 
bot arm and branches might lead to the damage of 
the robot arm, and the picking action of OF might 
cause the damage of non-target peaches. Therefore, 
when OB and OF appeared simultaneously for the 
same peach, OB was taken into considered. Addi- 
tionally, and when OF and OL appears at the same 
time, OF was considered. For the four annotated 
classes, the peaches inside white, green, cyan, and 
brown boxes represented the NO, OL, OF, and OB, 
respectively. 

It can be seen from Fig. 3 that all the peaches 
were manually labeled with bounding boxes that 
were tangent to peach outlines. In the case of occlu- 
sion, a peach whose occlusion area was greater than 
85% and the target at the edge of the image with 
less than 15% area were not labelled". After label- 
ing, TXT format annotation files, including peach 
class names and bounding box pixel coordinates, 
were generated. The dataset in this research con- 
tained a total of 4127 peach images, which could be 
divided into two types: 2077 naked peach images 
and 2050 bagging peach images, respectively. This 
dataset has been made publicly available at https:// 
github. com/tsing-luo/Multi-class-peach-RGB-D-da- 


taset. 
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(a) Naked peaches 


(b) Bagging peaches 


Fig. 3 Peaches were annotated into four classes, where fruits inside white, green, cyan, and brown boxes were referred to the 


NO, OL, OF, and OB, respectively 


2.3 Improved YOLOv5s network 


Nowadays, the YOLO”! series have been one 
of the most popular deep learning frame works 
among one-stage detectors, and widely used in tar- 
get detection tasks. In practical agriculture manage- 
ment, real-time detection under limited computation 
and storage resources of hardware were required, 
while there were limitations of the size and infer- 
ence time of the fruit detection algorithms. The 
newly proposed YOLOv5s performed well in the 
pursuit of a trade-off between accuracy and speed, 
which could offer the fastest inference speed being 
up to 140 FPS (frames per second). In addition, 
the weight file of the YOLOv5 model was only 
7.2 MB, nearly 90% less than YOLOV4. As depict- 
ed in Fig. 4, the YOLOv5s was employed as the ba- 
sis of the fruit detection model in this research. The 
model mainly includes three parts: Backbone, 
Neck, and Prediction head. Its structure was modi- 
fied by combining the coordinate attention mecha- 
nism and depthwise separable convolution in the 
backbone and neck parts. 

For the original YOLOvSs model, the CSP- 
Darknet53 was used as the backbone network. How- 
ever, due to the existence of complex backgrounds 
in orchards, the target features extracted from the 
images were easily disturbed, particularly in the 


case that the weeds and soil had close color to the 


leaves and branches of peach trees, causing the in- 
correct results of target detection. Meanwhile, the 
shallow feature map extracted from the backbone 
had a small receptive field that was suitable for de- 
tecting small targets, e.g., fruits””. Nevertheless, us- 
ing low-dimensional feature maps to increase the 
feature information of small targets might intro- 
duce a significant amount of background noise, 
particularly when using multi-modal images, which 
might further lead to the decrease in target detec- 
tion accuracy. 

In order to solve these problems, as shown in 
Fig. 4 and Fig. 5, the CSP module design was modi- 
fied in the backbone and improved the neck part by 
means of introducing an efficient attention mecha- 
nism, known as coordinate attention (CA), which in- 
herited the benefits of channel attention methods 
while simultaneously capturing long-range depen- 
dencies with precise positional information, sup- 
pressing unimportant features and promoting useful 
features”". 

Previous studies have proved that adding coor- 
dinate attention to the feature extraction part of the 
model could enhance the representation of attention 
region, while adding attention mechanism to the 
neck part of the model could improve the position 
sensitivity in the detection head, preserving the rela- 


tive positions between features, thus achieving 
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input 


Up-sampling Up-sampling 


Backbone Prediction head 


Note: CSP denotes cross stage partial layer; CA denotes coordinate attention; DSC denotes depthwise separable convolution; SPP denotes spa- 


tial pyramid pooling 
Fig. 4 Overall structure of the improved YOLOvSs model 
CXHXI CXHXI 
X Avg Pool | BatchNorm 1X1 Cony. 
HXI 1X1 Conv & Non-linear Sai + Sigmoid 
— ai pli ha 
Y Avg Pool C D 
vg Poo C ; 
=I —X1X(H+W) =X1X(H+W) me 
IXW CXIXW i $ IKI Cony CIKI 


Coordinate Information Coordinate Attention 


=E] Embedding Generation È EF F 
rrrrr al als 
Pr rrr PEPP 
F rrre rrr 
daddi s 
CXHXW CXHXW 
Fig. 5 Structure diagram of coordinate attention mechanism (corresponding to CA in Fig. 4) 
more accurate detection results’’’!, Specifically, allowed attention block to capture long-range de- 
given the shallow feature maps input, a pair of di- pendencies along one spatial direction and preserve 


rection-aware feature maps were yielded by means 
of using two spatial extents of pooling kernels 
(HX1) and (1XW) to encode each channel along 
the horizontal coordinate and the vertical coordi- 


nate, respectively. These two transformations also 


precise positional information along the other spa- 
tial direction, which helped the networks accurately 
locate the peaches. Then, the feature maps produced 
by the coordinate information embedding block 


were concatenated and sent to a shared 1X1 convo- 
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lutional transformation. The feature maps were con- 
C 

verted to — X 1 X H X W, r was the reduction 
r 


ratio for controlling the block size as in the SE 
block”. After the feature maps underwent the 
Batch Norm layer and Non-linear activation func- 
tion, the feature maps were split into separate ten- 
sors along the spatial dimension. Another 1X1 con- 
volutional transformation was utilized to separately 
transform horizontal dimension tensors f” and verti- 
cal dimension tensors f” to tensors with the same 
channel number to the input C X H X W, the out- 
put could be formulated as Equation (1) and (2). 

g = 0(C,(s")) (1) 

g"=0(C,(f")) (2) 
where, C denotes the convolutional transformation; 
o is the sigmoid function. Finally, the output Y is 
written as Equation (3). 

Ya) = Xa) . g'(i) x g”(j) (3) 

Additionally, due to the limitations of hard- 
ware resources in practical agriculture management, 
there were requirements of optimizing the size and 
computational cost of the fruit detection model in 
addition to improving detection accuracy, which 


was critical for facilitating its deployment on the in- 


Channal Feature 
maps / 


Filters 


input 


Depthwise 
convolution 


field harvesting robots. As shown in Fig. 4 and 6, 
depthwise separable convolution (DSC) was intro- 
duced to substitute part of regular convolutions in 
the backbone and neck network for reducing the 
model parameters and speeding up the detection in- 
ference time without penalizing the accuracy”'’”. 
The DSC was a combination of depth-wise convolu- 
tion and point-wise convolution. The deep-wise con- 
volution contained c, convolution kernels of size 
h X w X 1 and achieved the filtering work by act- 
ing on each channel. The point-wise convolution 
contained c, convolution kernels of size 1 X 1 X 
c, and took charge of the conversion channel by act- 
ing on the output feature map of the depth-wise con- 
volution. Therefore, the parameters of depthwise 
separable convolution and traditional convolution 
were as follows Equation (4) and (5). 

Pu = €; X (h X w X 1) +e, X 

(1X 1X e,) (4) 

Poon = Cy X (h xX wX c) (5) 

By comparing the parameters of P psc and Poons 
it can be found out that the DSC effectively decom- 
posed the traditional convolution by separating the 
spatial filtering from the feature generation mecha- 
nism. 


Feature 
maps / 


Filters 


Pointwise 
convolution 


Fig. 6 Depthwise separable convolution structure (corresponding to DSC in Fig. 4) 


For the specific parameters of the backbone, 
we chose multi-modal images with 640640 resolu- 
tion as the model input. The shallow feature infor- 
mation was aggregated through a Focus module, 
two-layer DSC, a CSP_CA_1 and a CSP _CA 3 


module, the feature dimension was converted to 


12812864. Then, additional features were ex- 
tracted through the two-layer CSP_CA 3 module, 
two-layer DSC, and a SPP module”. Three ade- 
quate feature levels were obtained, the first two fo- 
cused on the small-scale and medium-scale features 


and the last one focused on large-scale. Then, the 
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features were transferred to the neck part. 

In the prediction head part, the k-means clus- 
tering algorithm was used to find the anchor box, 
and complete IoU (CloU)"™ was used for model's 
bounding box regression loss, which toke three geo- 
metric properties into account, including overlap ar- 
ea, central point distance and aspect ratio, led to 
faster convergence and better performance. The for- 


mulae are as follows Equations (6)—(8). 


2 b, b* 
crou,,,=1 -10u ea (6) 
z 
4 w” W. 
v= p (arctan g T arctan y) (7) 
v 
oT -10U Fy 8) 


where, c represented the diagonal distance of the 
smallest closure area that can contained both the 
prediction bounding box and the ground truth 
bounding box; p’(b,b* ) represented the Euclidean 
distance between the center point of the predicted 
frame and the real frame; JOU represented a number 
from 0 to 1 that specifies the amount of overlap be- 
tween the predicted and ground truth bounding box; 
CIOU 


Loss 


was used to obtain the corresponding loss. 
By improving the backbone and neck network 
of the model and introducing CIoU loss function in 
the prediction head part, the size of the model was 
decreased, and the perception ability of the fruit de- 
tection model was improved, further enhancing the 
performance of detecting in-field peaches. The final 
output of the fruit detection model was the coordi- 
nate information of the peach targets (the prediction 
box of peach position) and the confidence level to a 
specific class, including NO, OL, OB, and OF. 


2.4 Model deployment 


The PyTorch framework was used to train the 
network and the model in PTH format was generat- 


ed. After training, the model was deployed on 


NVIDIA Jetson Nano for further evaluating the po- 
tential of real-time detection. Jetson Nano supports 
TensorRT to accelerate the model, which could im- 
prove the processing speed of neural networks by 
optimizing the algorithm architecture. Firstly, the 
PTH format model was converted to ONNX format, 
which was an intermediate framework to bridge Py- 
Torch model and TensorRT model. Then, the ON- 
NX format model was converted to TensorRT for- 
mat and tested on Jetson Nano. After the model was 
deployed on Jetson Nano, the time consumption 
was verified when using multi-modal images in de- 


tecting naked and bagging peaches. 


3 Experimental results and anal- 
ysis 


To thoroughly evaluate the performance of the 
improved YOLOv5s using multi-modal images and 
explore the contribution of each imaging modality 
when detecting multi-class naked and bagging 
peaches in natural orchards, different combinations 
of multi-modal images were input into the im- 
proved YOLOv5s. The performance of the model 
was evaluated in terms of precision (P), recall (R), 
mean average precision (mAP), and detection 
speed. Firstly, for purpose of evaluating the perfor- 
mance of multi-modal images in the model general- 
ization, the improved YOLOv5s was trained and 
validated based on different combinations of imag- 
ing modalities, and the quantitative analysis for test 
results on the naked and bagging peach detection 
was made. Secondly, to explore the contribution of 
each imaging modality in different orchard environ- 
ments on peach detection, the detection results of 
naked and bagging peaches were compared and ana- 
lyzed in several typical orchard scenarios, e.g., dif- 
ferent illumination conditions, fruit occlusion levels 


and camera distances. Finally, the ablation study 
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was conducted to verify the effectiveness of the co- 
ordinate attention mechanism and the depthwise 


separable convolution. 
3.1 Training platform and parameters 


The deep learning framework used in this 
study was PyTorch 1.11.0. The training and testing 
platform included a server with an Intel Xeon Gold 
5118 @ 2.30 GHz 12-core CPU, one NVIDIA 
RTX2080Ti (1620 MHz) GPU, with 4352 CUDA 
cores and 11 GB of memory running on the 
CentOS 7.9 system. The software tools included 
CUDA 11.2, CUDNN 7.6.5, and Python 3.7. Table 2 
shows the network initialization parameters. All in- 
put images were adjusted to 640640 pixels to 
adapt the input required for the network framework. 
Considering the memory constraints of the server, 
the batch size was set to eight in this research. 150 
epochs were used to better analyze the training pro- 
cess. Parameters like momentum, learning rate, 
weight decay, and other parameters referred to the 
parameters in the original YOLOvS5s model. 


Table 2 Initialization training parameters 


Input image Batch learning 
: . Momentum Decay Epochs 
size size rate 
640X 640 8 0.9 0.001 0.0005 150 


3.2 Evaluation indicators 


The performance of the model was evaluated 
by measuring the average precision (AP), mAP, and 
detection speed. Among them, AP was estimated by 
precision (P) and recall (R), indicating the sensitivi- 
ty of the network to target detection, and it was also 
an index that reflected the performance of the im- 
proved YOLOv5s model. The P and R was de- 


fined as Equations (9) and (10). 


TP 
P= Tp + FP (9) 


TP 
R= ToS EN (10) 


where, TP (True Positive) was the number of the tar- 
gets correctly detected; FP (False Positive) was the 
number of the targets detected as incorrect classifi- 
cation; FN (False Negative) was the number of tar- 
gets that were missed. The AP was defined in Equa- 
tion (11), which was the area under the P and R 
curves. The mAP was defined in Equation (12), 


which was the average value of AP. 


AP =| P(R)dR (11) 
mAP = + "AP (12) 


3.3 Performances of the improved YO- 
LOv5s using different multi-modal 
images 


In order to evaluate the performance of the im- 
proved YOLOv5s using multi-modal images in de- 
tecting multi-class naked and bagging peaches, the 
improved YOLOv5s was trained, validated, and test- 
ed based on different combinations of imaging mo- 
dalities in this section. The dataset used in this sec- 
tion was the multi-class naked peach dataset and the 
bagging peach dataset mentioned in Section 2.2. 
The multi-class naked peach dataset including 2077 
pairs of multi-modal images and randomly divided 
into three parts: training (70%), validation (10%), 
and testing (20%), respectively. The dataset used for 
training bagging peach detection model was the bag- 
ging peach dataset, with 1454, 208, and 415 pairs of 
multi-modal images for training, validation, and 
testing, respectively. The training and validation 
sets were applied to conduct the training of the mod- 
els and determine whether and when the model 
started to overfit based on the training and valida- 
tion curves. Then, the quantitative analysis for the 
test set was made to evaluate the final performance 


of the model using different multi-modal images. In 
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this study, a multi-modal image means a set of 
RGB, Depth, and IR images that are channel fused 
to obtain an image with four or five channels. For 
example, RGB + Depth and RGB + IR mean to 
stack a RGB image and corresponding Depth or IR 
image to obtain a multi-modal image with four 
channels. Similarly, the symbol of RGB + Depth + 
IR denotes the fusion implementation of RGB, 
Depth, and IR images into a five-channel image by 
means of channel stacking. As a result, the image 
number of dataset "RGB" "RGB + Depth" "RGB + 
IR" and "RGB + Depth + IR" were in the same, the 
only difference is the number of image channels in 
the input interface of the detection models. 

3.3.1 Training assessment 


From the validation curves in Fig. 7(a), it is ap- 


0.14 


val RGB 


——~train RGB+Depth val RGB+Depth 


0.12 


rain RGB+IR val RGB+IR 
rain RGB+Depth+IR ~~~ -val RGB+Depth+IR 


0.04 
0.02 


Epoch 


(a)The loss curves of naked peaches detection 


parent that the proposed model has not been overfit- 
ted during the training process. For training curves, 
the loss function reached lower values when using 
RGB+Depth+IR combination (plotted in brown), 
the fastest convergence speed appeared among all 
models when only using RGB modality (plotted in 
red). When using additional modalities like infrared 
(plotted in cyan) or depth (plotted in purple), the 
model showed lower loss values than only using 
RGB modality. For validation curves, it can be ob- 
served that the validation loss value and the training 
loss value had been very close to each other after 
the model converged, proving that the model 
learned the accurate feature information of the na- 


ked peach targets. 


train RGB ---- val RGB 

train RGB+Depth - - val RGB+Depth 
train RGBHR -= - - val RGB+IR 

train RGB+DepthHR - - - - val RGB+Depth+R| 


0 50 100 
Epoch 


(b) The loss curves of bagging peaches detection 


Fig. 7 The loss curves under different combinations of imaging modalities for naked and bagging peaches detection 


However, different from the naked peach detec- 
tor, as shown in Fig. 7(b), it can be seen that the pro- 
posed model started to overfit earlier when only us- 
ing RGB images than introducing some additional 
modalities like infrared or depth or infrared + depth. 
As can be seen in Fig. 7(b), the model has con- 
verged when trained to about 100 epochs, and then 
the model started to overfit. The fastest conver- 
gence speed appeared among all models when only 
using RGB modality but suffering from severe over- 


fitting. Specifically, for training curves, the loss 


function reached lower values when only using 
RGB modality. Nevertheless, the opposite results 
occurred with validation losses, the reason for this 
phenomenon was that the overfitting of the model 
occurred when only using RGB modality, and the 
introduction of infrared and depth modalities al- 
lowed model training with stronger overfitting 
avoidance at the expense of a little more iterations. 
Compared with the naked peaches, it is apparent 
that the infrared and depth modalities made more 


contributions to improving the ability of model 
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generalization and overfitting alleviation, and the 
best result was achieved when using all imaging 
modalities. 

Based on the above results, it can be conclud- 
ed that the infrared and depth modalities really 
helped in improving the ability of model generaliza- 
tion, as expected, the best results could be achieved 
when using five-channel images, namely all imag- 
ing modalities. 

3.3.2 Quantitative analysis for test results of 
different modal images 

Regarding the test set, Fig. 8(a) and (b) pres- 
ents the mAP of naked and bagging peaches with 
different combinations of imaging modalities versus 
epochs. Table 3 presents the naked and bagging 
peach detection results when using four combina- 
tions of different imaging modalities in detail. Com- 
paring the results from RGB images and 4-channel 
images (Table 3, rows 1—3), the RGB images with 
additional infrared modality offered the best perfor- 
mance with the mAP of 94.7% and 78.2% for naked 
and bagging peaches, followed by the mere RGB 
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(a)The mAP curves of naked peaches in the test set 


images with a mAP of 93.3% and 72.4%, respec- 
tively. The least valuable combination was the RGB 
images with the addition of depth modality, which 
was even less effective than only using RGB imag- 
es. Best fruit detection results were obtained when 
combining all modalities together, achieving the 
mAP of 98.6% and 88.9% for naked and bagging 
peaches. The most important benefit of introducing 
infrared and depth modalities was found out to be 
the precision metric in bagging peaches detection, 
increasing by 19.1% from 69.2% (RGB) to 88.3% 
(RGB+Depth+IR). This is because the extra geomet- 
ric information provided by infrared and depth mo- 
dalities was advantageous in reducing false posi- 
tives. The recall metric also increased when intro- 
ducing infrared and depth modalities, but not as sig- 
nificantly as the precision metric. When using the 
combination of RGB+Depth+IR modalities, the 
mAP achieved an improvement of 5.3% and 16.5% 
as compared to only using RGB images in detecting 


naked and bagging peaches. 


~ RGB+ Depth 
~ RGB+IR 
—— RGB+DepthHR| 


0 50 100 150 
Epoch 


(b) The mAP curves of bagging peaches in the test set 


Fig. 8 The mAP curves under different combinations of imaging modalities for naked and bagging peaches detection 


Between infrared and depth modalities, it 
should be emphasized that the introduction of the 
former brought more improvement of mAP. Addi- 
tionally, regarding the inference speed of the multi- 


modal target detection model, the inference time per 


image only slightly increased with the increment in 
the number of channels. This can be explained by 
the fact that the increase of image channels only af- 
fected the first layer of the convolutional network, 


consequently the increased computation cost was 
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Channels Detection speed/FPS 
Precision/% — Recall/% mAP/% Precision/% Recall/% mAP/% 
RGB 90.4 92.2 93.3 69.2 71.5 72.4 70.9 
RGB+Depth 92.1 91.5 92.7 68.7 TEZ 72.3 68.0 
RGB+IR 92.7 94.3 94.7 77.8 71.7 78.2 68.0 
RGB+DepthtIR 97.4 98.2 98.6 88.3 85.4 88.9 66.2 


negligible for the whole network. Taken together, 
these results demonstrated that three-dimensional 
spatial geometry and backscattered signal intensity 
information provided by infrared and depth modali- 
ties could effectively improve the fruit detection ac- 
curacy, especially in the case of implementing bag- 
ging peach detection. Last but not least, the afore- 
mentioned detection results impressed us that it was 
always more challenging to implement the detec- 
tion of bagging peaches compared to naked peaches 
in orchards. 

The improved YOLOvSs model was optimized 
by TensorRT to increase the inference speed on the 
Jetson Nano board. The model supports three kinds 
of precisions for optimization: floating-point 32 
(FP32), floating-point 16 (FP16) and integer 8 
(INT8). Since Jetson Nano does not support INT8 
optimization, the model was converted to floating 
point 32 (FP32) and floating point 16 (FP16) opera- 
tions, resulting in the detection speeds of 14 and 19 
FPS in the test set, respectively. That's to say, the 
implementation of 14 and 19 times detection per 
second on five-channel multi-modal images. There- 
fore, the improved YOLOvS5s model optimized by 
TensorRT-FP16 precision was selected for deploy- 
ment in the Jetson Nano development board, which 
was adequate for computer-vision based peach de- 


tection and harvesting. 


3.4 Contribution of different imaging mo- 
dalities in typical scenarios 


In order to explore the contribution of each im- 


aging modality on peach detection under different 
orchard environments, the test set visualized results 
of naked and bagging peach detection under typical 
orchard scenarios were analyzed. More specifically, 
in the real orchard environments, there were differ- 
ent illumination, different occlusion levels, and dif- 
ferent camera distance. The contribution of each im- 
aging modality was analyzed by different combina- 
tions of imaging modalities (RGB, RGB+Depth, 
RGB+IR and RGB+Depth+IR) for the detection of 
naked and bagging peaches under different scenari- 
os. Note that, in case of concurrent fruit occlusion, 
the model output will follow the labeling rules in 
the Section 2.2, which means that the priority se- 
quence of fruits was OB, OF, OL and NO from 
highest to lowest level, respectively. 
3.4.1 Comparison with different illumination 
conditions 

Fig. 9 shows the detection results of multi- 
class naked and bagging peaches using different mo- 
dality combinations (RGB, RGB+Depth, RGB+IR 
and RGB+Depth+IR) under three typical illumina- 
tion conditions in the test set. The peach trees under 
the three conditions were at the distance of about 
1 m from the camera. For each condition, four dif- 
ferent fruit detection results were separately present- 
ed depending on the input data type: RGB (first 
row), RGB+Depth (second row), RGB+IR (third 
row), and using the modalities of RGB, Depth, and 
IR simultaneously (fourth row), in odd columns 
were naked peaches, while in even columns were 


bagging peaches. 
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Note: Missing peaches were marked in red and false detections were in blue; fruits inside white, green, cyan, and brown boxes referred to the 


NO, OL, OF, and OB, respectively 


Fig. 9 Examples of multi-class naked and bagging peach detection results in the test set when using different modality combina- 


tions under three typical illumination conditions of Normal illumination, Strong illumination and Artificial illumination 


Under Normal and Strong illumination, the 
model detection performance when using the combi- 
nation of RGB+Depth images was even worse than 
only using RGB images, suffering from more miss- 
ing detections. The reason was the fact that ToF- 
based depth camera in outdoor environment was 
prone to noise interference due to the sunlight expo- 
sure, and the camera accuracy decreased as the mea- 
surement distance increased. As a result, there 
might be high possibilities that the fusion with RGB 
images made the detection model misjudged. Al- 
though the depth images were not suitable for peach 
detection in direct sunlight exposure, they did con- 
tribute to the better detection results in the artificial 
illumination. The explanation for this was in the 
nighttime, the depth camera was not interfered by 
sunlight noise and helped in accurately reconstruct- 
ing the peach shape. Especially, when detecting 
some of OB and OF peaches, the RGB images pre- 
sented a non-colored and invisible region of the 
peach edge, as well as similar color of peaches and 
leaves, whereas the depth images showed high dis- 


tinctive geometric features. Hence, the geometric 


features of the OB and OF peaches appeared in the 
depth images were more distinguishable than those 
in the RGB images, making them more conductive 
to be successfully detected in the nighttime. Mean- 
while, when comparing results before and after us- 
ing infrared modality in the natural environment of 
daytime, a reduction in false positives of NO and 
OL peaches was witnessed, especially in detecting 
bagging peaches. One possible explanation for this 
might be that there were significant differences 
among infrared intensity of fruits and leaves in 
the daytime. Thus, the infrared images could effec- 
tively help in distinguishing fruits from the back- 
ground in the acquired images under bright illumi- 
nation condition. 

Therefore, it can be concluded that the addition 
of depth images could help in reducing false and 
missing peaches under artificial illumination condi- 
tion, whereas the addition of infrared images could 
improve peach detection accuracy in bright illumi- 
nation environment compared to only using RGB 
images. Nevertheless, when all imaging modalities 


were used simultaneously, the best results could be 
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obtained in any illumination environments. 
3.4.2 Comparison with different occlusion 
levels 
As shown in Fig. 10, further experiments were 
conducted for analyzing the contribution of the dif- 
ferent imaging modalities to multi-class naked and 


Slight occlusion 


RGB+Depth 


RGB+Depth+IR 


General occlusion Severe occlusion 


bagging peach detection in different occlusion lev- 
els. As mentioned in Section 2.1, based on the im- 
age's proportion of peaches occluded by branches 
and leaves, the occlusion levels were considered 
as Slight occlusion, General occlusion, and Severe 


occlusion. 


re 


Note: Missing bagging peaches were marked in red and false detections were in blue, fruits inside white, green, cyan, and brown boxes re- 


ferred to the NO, OL, OF, and OB, respectively 


Fig. 10 Examples of multi-class naked and bagging peach detection results in the test set when using different modality combi- 


nations in different occlusion scenes 


Under Slight occlusion condition (first and sec- 
ond columns), the peach detection missed some 
peaches when only using RGB images, whereas all 
the naked and bagging peaches were accurately de- 
tected by other detectors after further fusing Depth 
or Infrared images. As the bagging peaches were 
wrapped in similarly colored bags, such as OF 
peaches, several overlapping peaches were not cor- 
rectly detected by the RGB detector. Meanwhile, as 
can be seen in the first column, naked peaches that 
were occluded and overlapping could be correctly 
detected after further fusing Depth or Infrared imag- 
es. Under General occlusion (third and fourth col- 
umn) status, the fusion of infrared and depth chan- 
nels provided more deep features of the peach tar- 
gets, and the detection results of fusing five-channel 
images had a more significant improvement in 


terms of accuracy and recall compared to those with 


fusing other three modalities. The depth and infra- 
red images could offer additional fruit geometry fea- 
tures that differed from RGB images, such as the in- 
formation of fruit edges, shape, and the distance, en- 
abling accurate fruit detection despite of the exis- 
tence of leaf or branch occlusion. It should be noted 
that the multi-modal images were more effective in 
improving the bagging peach detection accuracy 
than naked peaches, which could significantly re- 
duce the rate of missing and false detections. Simi- 
larly, in the case of Severe occlusion (fifth and sixth 
column), although there were still cases of missing 
detections in detecting naked and bagging peaches 
even using five-channel images, the detection re- 
sults were still significantly better than those using 
other multi-modal combinations. 

Hence, it can be concluded that the introduc- 


tion of infrared and depth modalities could provide 
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the model with more valuable information, e.g., geo- 
metric features, consequently improving the accura- 
cy and recall rate of fruit detection, even in the case 
of severely occluded peaches. 
3.4.3 Comparison with different camera dis- 
tances 

Fig. 11 presents the effect of different imaging 

modalities on multi-class naked and bagging peach 


Close distance 
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detection with different camera distances. As men- 
tioned in Section 2.1, the distance within 0.3 m 
away from the tree trunk to the camera was consid- 
ered as Close distance. The distance between 0.3 m 
and 1 m was considered as Average distance, and 
the Far distance referred to the distance of greater 


than 1 m away from the tree trunk. 


Note: Missing bagging peaches were marked in red and false detections were in blue, fruits inside white, green, cyan, and brown boxes re- 


ferred to the NO, OL, OF, and OB, respectively 


Fig. 11 Examples of multi-class naked and bagging peach detection results in the test set when using different modality combi- 


nations in different camera distances 


The scene of the Close distance was shown in 
the first and second columns of Fig. 11, where some 
peaches were less than 0.2 m away from the cam- 
era, and the others were within 0.3 m. As can be 
seen, the model using only RGB images achieved 
the best detection results both in naked and bagging 
peaches, in contrast, the model received large 
amount of missing and false detections after fusing 
depth and infrared images. What's worse, the peach- 
es within the distance of 0.2 m failed to be detected 
accurately. The reason was that the depth informa- 
tion was obtained based on the ToF mechanism in 
our work, however, there were operation distance 
requirements when using Azure Kinect DK camera. 
As can be seen in the first and second columns of 


Fig. 11, while the distance between camera and 


peaches was not in the camera operation distance, 
the depth and infrared information of the peaches 
will be lost, which further imposed negative effect 
the detection after fusion with the RGB images. 
Similar results occurred in Far distance detection, 
where some peaches far from the camera and se- 
verely occluded failed to be detected even when 
five-channel images were employed simultaneous- 
ly. When the camera operated within the distance 
from 0.3 m to 1 m from the tree trunk, that's to say, 
at Average distances, there were best detection re- 
sults when using the combination of five channels. 
Therefore, there are requirements of appropri- 
ate camera operating distance if one intends to im- 
prove the detection accuracy of peaches by introduc- 


ing infrared and depth modalities. In conclusion, the 
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introduction of infrared and depth channels definite- 
ly could improve the detection accuracy of occlud- 
ed fruits, but only when the camera operated within 


an appropriate distance. 


3.5 Ablation experiments of the im- 
proved YOLOv5s 


To demonstrate the effectiveness of the im- 
provement in YOLOvS5s, this section conducts an 
ablation study of the improved YOLOv5s using all 
imaging modalities for multi-class peach detection. 
Specifically, the comparison contains a baseline and 
other three cases. The baseline model was the origi- 
nal YOLOv5s without attention mechanism and the 


depthwise separable convolution. Then, the coordi- 


nate attention mechanism and the depthwise separa- 
ble convolution were integrated into YOLOv5s sep- 
arately for enhancing the learning of important in- 
formation, as well as reducing the number of model 
parameters. The network that fused DSC in the YO- 
LOvS5s were denoted as YOLOvSs-DSC, while the 
network that only used CA were denoted as YO- 
LOv5s-CA. The results were compared with those 
models using all imaging modalities, which meant 
that the same RGB images training dataset as in 
Section 3.3, as well as their corresponding infrared 
and depth images, were considered. 

As summarized in Table 4, four comparison ex- 
periments were carried out to investigate the perfor- 


mance of the CA and the DSC modules. 


Table 4 Detection results of different models in the test set of naked and bagging peaches 


Naked peaches Bagging peaches Detection 
Models Parameters/M 
Precision/% Recall/% mAP/% Precision/% Recall/% mAP/% speed /FPS 
YOLOv5s 96.5 94.2 95.8 81.7 89.6 82.7 7.07 66.2 
YOLOv5s-DSC 95.4 93.5 95.0 80.0 78.0 80.0 5.03 94.3 
YOLOv5s-CA 97.7 98.3 98.8 88.5 85.1 89.4 712 66.2 
Improved YOLOvS5s 97.4 98.2 98.6 88.3 85.4 88.9 5.08 77.5 


It can be seen from Table 4, when embedding 
the attention mechanism into the YOLOv5s, the 
mAP of YOLOv5s-CA was 98.8% for naked peach- 
es, increasing by 3% as compared to YOLOvSs, out- 
performing all of the rest models. Unexpectedly, af- 
ter substituting the regular convolution to DSC, the 
mAP of YOLOv5s-DSC was 95.0%, which was 
slightly lower than YOLOvS5s. What should not be 
ignored was that the mAP of YOLOvS5s-CA and 
YOLOv5s-DSC for bagging peaches was 89.4% 
and 80.0%, respectively. With regard to the model 
parameters, the YOLOv5s-DSC was 5.03 M, which 
was the least one among all models and 39.9% less 
than the original YOLOvSs. In terms of detection 
speed, the YOLOv5s-DSC model was 30.5% faster 
than that YOLOv5s, which indicated that the DSC 


module was more cost-effective than regular convo- 


lution. Surprisingly, the YOLOv5s-CA model's de- 
tection speed was the same as the original YO- 
LOv5s, verifying that the coordinate attention mod- 
ule could enhance the feature extraction ability with- 
out significantly increasing the model parameters. 
After further fusing the DSC and CA module, the 
mAP of the improved YOLOvS5s model recorded 
better results than the original YOLOvS5s and YO- 
LOv5s-DSC, increasing by 2.8% and 6.2% than the 
YOLOv5s on the naked and bagging peach detec- 
tion. Note that the mAP of the improved YOLOv5s 
model decreased very slightly in naked and bagging 
peaches detection when compared to the YOLOv5s- 
CA. However, the improved YOLOv5s achieved 
77.5 FPS detection speed with fewer parameters, 
which was 14.6% faster than the YOLOvSs-CA. 


Overall, these experimental results demonstrat- 
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ed that the introduced CA and DSC was effective in 
improving detection accuracy and reducing compu- 
tational cost of the YOLOvSs, as the proposed mod- 
el could detect naked and bagging peaches in or- 
chards with faster speed and higher accuracy while 


requiring fewer parameters. 
3.6 Comparison and discussion 


3.6.1 Comparison with other object detection 
networks 


To further analyze the performance of the im- 


proved YOLOvSs network, the performance of the 
model was compared with that of three lightweight 
object detection networks: YOLOX-Nano"™®, PP- 
YOLO-Tiny"” and EfficientDet-D0". The same 
training, validation, test sets and image channels 
(five-channel images, RGB+Depth+IR) were used 
to train and test the three networks. The detection 
results of the three methods on the test set are 


shown in Table 5. 


Table 5 Detection results of different lightweight object detection models in the test set of naked and bagging peaches 


Naked peaches Bagging peaches 
Detection 
Models Parameters/M 
Precision/%  Recall/% mAP/% Precision/% Recall/% mAP/% speed/FPS 
YOLOX-Nano 76.3 74.1 75.7 71.4 79.2 72.6 0.91 170.2 
PP-YOLO-Tiny 78.2 79.9 80.5 80.4 78.8 80.8 1.3 154.5 
EfficientDet-D0 85.3 86.9 87.7 86.2 85.1 85.4 3.9 110.7 
Improved YOLOv5s 97.4 98.2 98.6 88.3 85.4 88.9 5.08 77.5 


As can be seen from Table 5, improved YO- 
LOv5s achieved best results in terms of precision, 
recall, and mAP compared with other three net- 
works. The mAP of improved YOLOvS5s was 
98.6%, which was 22.9%, 18.1 and 10.9% higher 
than those of the YOLOX-Nano, PP-YOLO-Tiny, 
and EfficientDet-DO in detecting naked peaches, re- 
spectively. Meanwhile, compared with other three 
networks in detecting bagging peaches, the im- 
proved YOLOv5s also was the best in terms of 
mAP, which was 16.3%, 8.1% and 4.5% higher than 
YOLOX-Nano, PP-YOLO-Tiny, and EfficientDet- 
DO, respectively. Although the average detection 
speed of improved YOLOv5s on the test set was 
77.5 FPS, which was slower than that of other three 
networks, the detection accuracy was effectively im- 
proved. 

From the comparison results, it can be conclud- 
ed that the peach detection network based on the im- 
proved YOLOv5s proposed in this study can detect 


peaches more effectively and accurately than other 
lightweight networks. 
3.6.2 Comparison with other fruit detection 
studies 

In addition, two open-source fruit datasets in- 
cluding apple and kiwifruit were also performed to 
assess the effective of the improved YOLOv5s on 
the fruit detection. Specifically, Gené-Mola et al. ”! 
presented a Fuji apple dataset, which were acquired 
at night using a depth camera, and used Faster R- 
CNN for apple detection. Suo et al."*! classified the 
kiwifruit dataset into five classes based on occlu- 
sion status and used YOLOv4 for fruit detection. 
Since there were different label classifications and 
image resolutions in these two datasets, the same la- 
bel classifications and image resolutions as the raw 
dataset for ensuring that the comparisons were 
made under the same experimental conditions. Pa- 
rameters like momentum, learning rate, weight de- 


cay, and other parameters referred to the parameters 
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in the original YOLOv5s model. Meanwhile, both 
the aforementioned datasets were split into training 
and test set, conducting the training and test of the 
improved YOLOvSs. Experimental results present- 


ed in Table 6 revealed that the proposed improved 


YOLOv5s model offered better results in different 
degrees than other methods in detecting fruits and 
verified the effectiveness on different detection 


tasks. 


Table 6 Detection results of different models on two open-source fruit datasets. 
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Model Precision/% Recall/% mAP/% 
Faster R-CNN (Fuji apple) 84.7 88.8 92.7 
Improved YOLOvSs (Fuji apple) 95.9(+8.2) 96.3(+42.5) 98.2(+5.5) 
YOLOv4 (Hayward kiwifruit) — — 91.9 
Improved YOLOv5s (Hayward kiwifruit) 95.32 94.63 94.2 (42.3) 


4 Conclusions 


It is crucial to develop good methods of effec- 
tively detecting the fruits with different agronomic 
measurements for improving the popularity of me- 
chanical harvesting. In this paper, a multi-class 
RGB-D dataset of natural naked and bagging peach- 
es has been made publicly available, being the first 
multi-class peach detection dataset. The improved 
multi-class peach detector based on YOLOv5s by 
fusing multi-modal images as input and introducing 
coordinate attention mechanism and depthwise sepa- 
rable convolution was presented. 

The experimental comparison results showed 
that the improved YOLOv5Ss using multi-modal vi- 
sual data offered the detection mAP of 98.6% and 
88.9% on the naked and bagging peach in complex 
illumination and severe occlusion environment, in- 
creasing by 5.3% and 16.5% than using RGB imag- 
es, as well as by 2.8% and 6.2% when compared to 
YOLOv5s. While compared with other networks in 
detecting bagging peaches, the improved YOLOvSs 
was the best in terms of mAP, which was 16.3%, 
8.1% and 4.5% higher than YOLOX-Nano, PP-YO- 
LO-Tiny, and EfficientDet-D0, respectively. The im- 
proved YOLOvS5s with multi-modal visual data 


could enhance the model's perception ability in de- 


tecting both naked and bagging peaches under se- 
vere occlusion scenes, as well as under various illu- 
mination conditions. 

In particular, the depth imaging modality could 
reduce the false and missing detection of peach tar- 
gets under artificial illumination condition, and the 
infrared imaging modality could improve the detec- 
tion accuracy under strong illumination condition. 

Additionally, it was found out that the pro- 
posed detection model could reach 19 times detec- 
tion per second with the considered five-channel 
multi-modal images on popular embedded platform, 
which could meet the real-time requirement of fruit 
harvesting system. 

The main limitation of using five-channel 
multi-modal images was the underutilization of spa- 
tial geometric information in the depth and infrared 
images. Future work includes the exploration of 
stronger fruit detection networks and multi-modal 
image fusion methods for further improving the de- 
tection of the in-field bagging fruits, as well as ones 


with various types of bags wrapped. 


References: 


[1] YADAV S, SENGAR N, SINGH A, et al. Identification 
of disease using deep learning and evaluation of bacte- 
riosis in peach leaf[J]. Ecological Informatics, 2021, 
61: ID 101247. 


ChinaXivA@ (ERAT! 


103 


202302.00133v1 


chinaXiv 


[2] 


[3] 


[4] 


[5] 


[6] 


[7] 


[8] 


[9] 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


GENE-MOLA J, VILAPLANA V, ROSELL-POLO J 
R, et al. Multi-modal deep learning for Fuji apple de- 
tection using RGB-D cameras and their radiometric ca- 
pabilities[J]. Computers and Electronics in Agriculture, 
2019, 162: 689-698. 

NGUYEN T T, VANDEVOORDE K, WOUTERS N, 
et al. Detection of red and bicoloured apples on tree 
with an RGB-D camera[J]. Biosystems Engineering, 
2016, 146: 33-44. 

LIU X, JIA W, RUAN C, et al. The recognition of ap- 
ple fruits in plastic bags based on block classifica- 
tion[J]. Precision Agriculture, 2018, 19(4): 735-749. 
LIU T, EHSANI R, TOUDESHKI A, et al. Identifying 
immature and mature pomelo fruits in trees by ellipti- 
cal model fitting in the Cr-Cb color space[J]. Precision 
Agriculture, 2019, 20(1): 138-156. 

LIU Y, CHEN B, QIAO J. Development of a machine 
vision algorithm for recognition of peach fruit in a nat- 
ural scene[J]. Transactions of the ASABE, 2011, 54(2): 
695-702. 

WILLIAMS H A M, JONES M H, NEJATI M, et al. 
Robotic kiwifruit harvesting using machine vision, con- 
volutional neural networks, and robotic arms[J]. Bio- 
systems Engineering, 2019, 181: 140-156. 

NAVAS E, FERNANDEZ R, SEPULVEDA D, et al. 
Soft grippers for automatic crop harvesting: A re- 
view[J]. Sensors, 2021, 21(8): ID 2689. 

TU S, PANG J, LIU H, et al. Passion fruit detection 
and counting based on multiple scale faster R-CNN us- 
ing RGB-D images[J]. Precision Agriculture, 2020, 21 
(5): 1072-1091. 

HANI N, ROY P, ISLER V. A comparative study of 
fruit detection and counting methods for yield mapping 
in apple orchards[J]. Journal of Field Robotics, 2020, 
37(2): 263-282. 

LU S, CHEN W, ZHANG X, et al. Canopy-attention- 
YOLOv4-based immature/mature apple fruit detection 
on dense-foliage tree architectures for early crop load 
estimation[J]. Computers and Electronics in Agricul- 
ture, 2022, 193: ID 106696. 

LI X, PAN J, XIE F, et al. Fast and accurate green pep- 
per detection in complex backgrounds via an improved 
YOLOv4-tiny model[J]. Computers and Electronics in 
Agriculture, 2021, 191: ID 106503. 

JIANG M, SONG L, WANG Y, et al. Fusion of the YO- 
LOv4 network model and visual attention mechanism 
to detect low-quality young apples in a complex envi- 
ronment[J]. Precision Agriculture, 2022, 23(2): 
559-577. 

HUANG H, HUANG T, LI Z, et al. Design of citrus 
fruit detection system based on mobile platform and 
edge computer device[J]. Sensors, 2021, 22(1): ID 59. 
FU L, GAO F, WU J, et al. Application of consumer 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


[28] 


RGB-D cameras for fruit detection and localization in 
field: A critical review[J]. Computers and Electronics 
in Agriculture, 2020, 177: ID 105687. 

SA I, GE Z, DAYOUB F, et al. Deepfruits: A fruit de- 
tection system using deep neural networks[J]. Sensors, 
2016, 16(8): ID 1222. 

ARAD B, BALENDONCK J, BARTH R, et al. Devel- 
opment of a sweet pepper harvesting robot[J]. Journal 
of Field Robotics, 2020, 37(6): 1027-1039. 

SUO R, GAO F, ZHOU Z, et al. Improved multi-class- 
es kiwifruit detection in orchard to avoid collisions dur- 
ing robotic picking[J]. Computers and Electronics in 
Agriculture, 2021, 182: ID 106052. 

TIAN Y, YANG G, WANG Z, et al. Apple detection 
during different growth stages in orchards using the im- 
proved YOLO-v3 model[J]. Computers and Electron- 
ics in Agriculture, 2019, 157: 417-426. 

REDMON J, DIVVALA S, GIRSHICK R, et al. You 
only look once: Unified, real-time object detection[C]// 
Proceedings of the IEEE Conference on Computer Vi- 
sion and Pattern Recognition. Piscataway, New York, 
USA: IEEE, 2016: 779-788. 

REDMON J, FARHADI A. YOLOv3: An incremental 
improvement[J/OL]. arXiv:1804.02767[cs.CV], 2018. 
REDMON J, FARHADI A. YOLO9000: Better, faster, 
stronger[C]// Proceedings of the IEEE Conference on 
Computer Vision and Pattern Recognition. Piscataway, 
New York, USA: IEEE, 2017: 7263-7271. 
BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YO- 
LOv4: Optimal speed and accuracy of object detec- 
tion[J/OL]. arXiv: 2004.10934[cs.CV], 2020. 

YAN B, FAN P, LEI X, et al. A real-time apple targets 
detection method for picking robot based on improved 
YOLOvS[J]. Remote Sensing, 2021, 13(9): ID 1619. 
LIU S, QI L, QIN H, et al. Path aggregation network 
for instance segmentation[C]// Proceedings of the 
IEEE Conference on Computer Vision and Pattern Rec- 
ognition. Piscataway, New York, USA: IEEE, 2018: 
8759-8768. 

HOU Q, ZHOU D, FENG J. Coordinate attention for 
efficient mobile network design[C]// Proceedings of 
the IEEE/CVF Conference on Computer Vision and 
Pattern Recognition. Piscataway, New York, USA: 
IEEE, 2021: 13713-13722. 

FANG L, WU Y, LI Y, et al. Ginger seeding detection 
and shoot orientation discrimination using an improved 
YOLOv4-LITE network[J]. Agronomy, 2021, 11(11): 
ID 2328. 

SHI C, LIN L, SUN J, et al. A lightweight YOLOv5S 
transmission line defect detection method based on co- 
ordinate attention[C]// 2022 IEEE 6th Information 
Technology and Mechatronics Engineering Conference 
(ITOEC). Piscataway, New York, USA: IEEE, 2022, 6: 


104 


ChinaXivA@ (ERAT! 


202302.00133v1 


chinaXiv 


[29] 


[30] 


[31] 


[32] 


[33] 


1779-1785. 

ZHA M, QIAN W, YI W, et al. A lightweight YOLOv4- 
based forestry pest detection method using coordinate 
attention and feature fusion[J]. Entropy, 2021, 23(12): 
1587. 

HU J, SHEN L, SUN G. Squeeze-and-excitation net- 
works[C]// Proceedings of the IEEE Conference on 
Computer Vision and Pattern Recognition. Piscataway, 
New York, USA: IEEE, 2018: 7132-7141. 

ZHANG Y, YU J, CHEN Y, et al. Real-time strawberry 
detection using deep neural networks on embedded sys- 
tem (RTSD-net): An edge AI application[J]. Computers 
and Electronics in Agriculture, 2022, 192: ID 106586. 
CHOLLET F. Xception: Deep learning with depthwise 
separable convolutions[C]// Proceedings of the IEEE 
Conference on Computer Vision and Pattern Recogni- 
tion. Piscataway, New York, USA: IEEE, 2017: 1251- 
1258. 

BOCHKOVSKIY A, WANG C Y, LIAO H Y M. 
YOLOv4: Optimal speed and accuracy of object de- 


[34] 


[35] 


[36] 


[37] 


[38] 


tection[J/OL]. arXiv: 2004.10934[cs.CV], 2020. 
ZHENG Z, WANG P, LIU W, et al. Distance-IoU loss: 
Faster and better learning for bounding box regres- 
sion[C]// Proceedings of the AAAI Conference on Arti- 
ficial Intelligence. Piscataway, New York, USA: IEEE, 
2020, 34(7): 12993-13000. 

POWERS D M W. Evaluation: from precision, recall 
and F-measure to ROC, informedness, markedness and 
correlation[J/OL]. arXiv: 2010.16061[cs.LG], 2020. 
GE Z, LIU S, WANG F, et al. YOLOx: Exceeding yolo 
series in 2021[J/OL]. arXiv: 2107.08430[cs.CV], 2021. 
LONG X, DENG K, WANG G, et al. PP-YOLO: An 
effective and efficient implementation of object detec- 
tor[J/OL]. arXiv: 2007.12099[cs.CV], 2020. 

TAN M, PANG R, LE Q V. EfficientDet: Scalable and 
efficient object detection[C]// Proceedings of the IEEE/ 
CVF Conference on Computer Vision and Pattern Rec- 
ognition. Piscataway, New York, USA: IEEE, 2020: 
10781-10790. 


E FUE YOLOvSs MS ASA RAH EEDE N 


Bo Ra B 


FE 1:23" A Fe 1,2,3 | JL BFE 1.2.3 Ez 


PDI 


O. BCR MKS (ABSIT AALS BE. ZANE 230036; 2. RARER AIARRA ERE , KRAE 


1,2,3 
36. , 


+ 1,2,3 


230036; 3. FRA MARA RE RIA 


HB: BPR SCHEMA Re Ei 


HEI, AERP PSE, FOALS EEN bell 


BSCR SLAB AE 230036) 


BRA A NE PAT, FH OCIA AAS A E 


HL ee PEAK. AS OSE SET HE YOLOvSs FIZ EAS AL 


ie RHEE HT TE Tl LRA RAY Ee PS HE ea. AA, PRET HS BEAMS EEN 
RGB-D žE, 40584127 2H HK BRA RGB-D FAWLSR AL AY RAR OT FE, TREE ZED AR. BA, S| 


ATi We BAA URN EELE, ET HER YOLOvSs UDR) Bi, AER We 
2S [| Fy (al RBS OR, SRY AAS TB RE OE, BEE. TR, iI 
ERRED REJ RET ERS EE, TED ER, AR BET SS ERER a EE 
IRORA E E, VREE KERR, EH EREA e KE AY aE YOLOv5s AITE 
ALIGNS AL ERAT, REISE SEE DEAE LE (Mean Average Precision, mAP) 43X 98.6% Fil 
88.9%, LEALE RGB KHATER T 5.3% 716.5%, ke YOLOvSs Hers T 2.8% #6.2%. TEES ERE RIT M , 
cit YOLOvS5s HJ mAP HE YOLOX-Nano, PP-YOLO-Tiny Ñ EfficientDet-D0 4} HIPEFF T 16.3%., 8.1% FI 4.5%. 
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